90 research outputs found

    Sparse integrative clustering of multiple omics data sets

    Get PDF
    High resolution microarrays and second-generation sequencing platforms are powerful tools to investigate genome-wide alterations in DNA copy number, methylation and gene expression associated with a disease. An integrated genomic profiling approach measures multiple omics data types simultaneously in the same set of biological samples. Such approach renders an integrated data resolution that would not be available with any single data type. In this study, we use penalized latent variable regression methods for joint modeling of multiple omics data types to identify common latent variables that can be used to cluster patient samples into biologically and clinically relevant disease subtypes. We consider lasso [J. Roy. Statist. Soc. Ser. B 58 (1996) 267-288], elastic net [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005) 301-320] and fused lasso [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005) 91-108] methods to induce sparsity in the coefficient vectors, revealing important genomic features that have significant contributions to the latent variables. An iterative ridge regression is used to compute the sparse coefficient vectors. In model selection, a uniform design [Monographs on Statistics and Applied Probability (1994) Chapman & Hall] is used to seek "experimental" points that scattered uniformly across the search domain for efficient sampling of tuning parameter combinations. We compared our method to sparse singular value decomposition (SVD) and penalized Gaussian mixture model (GMM) using both real and simulated data sets. The proposed method is applied to integrate genomic, epigenomic and transcriptomic data for subtype analysis in breast and lung cancer data sets.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS578 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Random lasso

    Full text link
    We propose a computationally intensive method, the random lasso method, for variable selection in linear models. The method consists of two major steps. In step 1, the lasso method is applied to many bootstrap samples, each using a set of randomly selected covariates. A measure of importance is yielded from this step for each covariate. In step 2, a similar procedure to the first step is implemented with the exception that for each bootstrap sample, a subset of covariates is randomly selected with unequal selection probabilities determined by the covariates' importance. Adaptive lasso may be used in the second step with weights determined by the importance measures. The final set of covariates and their coefficients are determined by averaging bootstrap results obtained from step 2. The proposed method alleviates some of the limitations of lasso, elastic-net and related methods noted especially in the context of microarray data analysis: it tends to remove highly correlated variables altogether or select them all, and maintains maximal flexibility in estimating their coefficients, particularly with different signs; the number of selected variables is no longer limited by the sample size; and the resulting prediction accuracy is competitive or superior compared to the alternatives. We illustrate the proposed method by extensive simulation studies. The proposed method is also applied to a Glioblastoma microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS377 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Sparse linear discriminant analysis by thresholding for high dimensional data

    Full text link
    In many social, economical, biological and medical studies, one objective is to classify a subject into one of several classes based on a set of variables observed from the subject. Because the probability distribution of the variables is usually unknown, the rule of classification is constructed using a training sample. The well-known linear discriminant analysis (LDA) works well for the situation where the number of variables used for classification is much smaller than the training sample size. Because of the advance in technologies, modern statistical studies often face classification problems with the number of variables much larger than the sample size, and the LDA may perform poorly. We explore when and why the LDA has poor performance and propose a sparse LDA that is asymptotically optimal under some sparsity conditions on the unknown parameters. For illustration of application, we discuss an example of classifying human cancer into two classes of leukemia based on a set of 7,129 genes and a training sample of size 72. A simulation is also conducted to check the performance of the proposed method.Comment: Published in at http://dx.doi.org/10.1214/10-AOS870 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Doubly Regularized REML for Estimation and Selection of Fixed and Random Effects in Linear Mixed-Effects Models

    Get PDF
    The linear mixed effects model (LMM) is widely used in the analysis of clustered or longitudinal data. In the practice of LMM, the inference on the structure of the random effects component is of great importance, not only to yield proper interpretation of subject-specific effects but also to draw valid statistical conclusions. This task of inference becomes significantly challenging when a large number of fixed effects and random effects are involved in the analysis. The difficulty of variable selection arises from the need of simultaneously regularizing both mean model and covariance structures, with possible parameter constraints between the two. In this paper, we propose a novel method of doubly regularized restricted maximum likelihood to select fixed and random effects simultaneously in the LMM. The Cholesky decomposition is invoked to ensure the positive-definiteness of the selected covariance matrix of random effects, and selected random effects are invariant with respect to the ordering of predictors appearing in the Cholesky decomposition. We then develop a new algorithm that solves the related optimization problem effectively, in which the computational cost is comparable with that of the Newton-Raphson algorithm for MLE or REML in the LMM. We also investigate large sample properties for the proposed method, including the oracle property. Both simulation studies and data analysis are included for illustration

    Doubly Penalized Buckley-James Method for Survival Data with High-Dimensional Covariates

    Get PDF
    Recent interest in cancer research focuses on predicting patients\u27 survival by investigating gene expression profiles based on microarray analysis. We propose a doubly penalized Buckley-James method for the semiparametric accelerated failure time model to relate high-dimensional genomic data to censored survival outcomes, which uses a mixture of L1-norm and L2-norm penalties. Similar to the elastic-net method for linear regression model with uncensored data, the proposed method performs automatic gene selection and parameter estimation, where highly correlated genes are able to be selected (or removed) together. The two-dimensional tuning parameter is determined by cross-validation and uniform design. The proposed method is evaluated by simulations and applied to the Michigan squamous cell lung carcinoma study

    Simultaneous variable selection for joint models of longitudinal and survival outcomes

    Get PDF
    Joint models of longitudinal and survival outcomes have been used with increasing frequency in clinical investigations. Correct specification of fixed and random effects is essential for practical data analysis. Simultaneous selection of variables in both longitudinal and survival components functions as a necessary safeguard against model misspecification. However, variable selection in such models has not been studied. No existing computational tools, to the best of our knowledge, have been made available to practitioners. In this article, we describe a penalized likelihood method with adaptive least absolute shrinkage and selection operator (ALASSO) penalty functions for simultaneous selection of fixed and random effects in joint models. To perform selection in variance components of random effects, we reparameterize the variance components using a Cholesky decomposition; in doing so, a penalty function of group shrinkage is introduced. To reduce the estimation bias resulted from penalization, we propose a two-stage selection procedure in which the magnitude of the bias is ameliorated in the second stage. The penalized likelihood is approximated by Gaussian quadrature and optimized by an EM algorithm. Simulation study showed excellent selection results in the first stage and small estimation biases in the second stage. To illustrate, we analyzed a longitudinally observed clinical marker and patient survival in a cohort of patients with heart failure
    • …
    corecore